[SPARK-38918][SQL] Nested column pruning should filter out attributes that do not belong to the current relation by allisonwang-db · Pull Request #36216 · apache/spark

allisonwang-db · 2022-04-16T00:40:04Z

What changes were proposed in this pull request?

This PR updates ProjectionOverSchema to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the RewriteSubquery batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions.

Why are the changes needed?

To fix a bug in SchemaPruning.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

HyukjinKwon · 2022-04-18T00:01:25Z

cc @viirya fyi

viirya · 2022-04-18T01:57:09Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ProjectionOverSchema.scala

+ *               attributes in the schema that do not belong to the current relation.
 */
-case class ProjectionOverSchema(schema: StructType) {
+case class ProjectionOverSchema(schema: StructType, output: Option[AttributeSet] = None) {


We don't always need it? It is a None by default.

Looks like AttributeSet is required for correctness. If we make it required, can we drop fieldNames var below and just check the attribute set?

viirya · 2022-04-18T01:59:37Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

+            |where not exists (select null from employees e where e.name.first = c.name.first
+            |  and e.employer.name = c.employer.company.name)
+            |""".stripMargin)
+        checkAnswer(query, Row(3))


Should we check pruning schema too?

viirya · 2022-04-18T02:18:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala

    }
  }

+  testSchemaPruning("SPARK-38918: nested schema pruning with correlated subqueries") {


Without this PR, this test failed with java.lang.RuntimeException: Once strategy's idempotence is broken for batch RewriteSubquery.

Yes it looks like a separate issue with column pruning and subquery rewrite (data source v1 only). I will investigate more.

viirya

This fix looks good. With a few comments.

viirya · 2022-04-22T23:33:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

  override protected val excludedOnceBatches: Set[String] =
    Set(
      "PartitionPruning",
+      "RewriteSubquery",


What's this for?

I discovered that this Once batch is not idempotent. ColumnPruning and CollapseProject can be applied multiple times after correlated IN/EXISTS subqueries are rewritten. Happy to discuss other ways to fix/improve this batch. cc @cloud-fan

Attached the plan change log for the test case:

=== Applying Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project ! +- Filter NOT exists#157 [name#117 && employer#122 && (name#152.first = name#117.first) && (employer#153.name = employer#122.company.name)] +- Join LeftAnti, ((name#152.first = name#117.first) AND (employer#153.name = employer#122.company.name)) ! : +- Project [null AS NULL#163, name#152, employer#153] :- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! : +- Relation [id#151,name#152,employer#153] parquet +- Project [null AS NULL#163, name#152, employer#153] ! +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet +- Relation [id#151,name#152,employer#153] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project ! +- Join LeftAnti, ((name#152.first = name#117.first) AND (employer#153.name = employer#122.company.name)) +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) ! :- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet :- Project [id#116, name#117.first AS _extract_first#167, address#118, pets#119, friends#120, relatives#121, employer#122.company.name AS _extract_name#169, relations#123, p#124] ! +- Project [null AS NULL#163, name#152, employer#153] : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! +- Relation [id#151,name#152,employer#153] parquet +- Project [_extract_first#166, _extract_name#168] ! +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] ! +- Project [name#152, employer#153] ! +- Relation [id#151,name#152,employer#153] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) :- Project [id#116, name#117.first AS _extract_first#167, address#118, pets#119, friends#120, relatives#121, employer#122.company.name AS _extract_name#169, relations#123, p#124] :- Project [id#116, name#117.first AS _extract_first#167, address#118, pets#119, friends#120, relatives#121, employer#122.company.name AS _extract_name#169, relations#123, p#124] : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! +- Project [_extract_first#166, _extract_name#168] +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] ! +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] +- Relation [id#151,name#152,employer#153] parquet ! +- Project [name#152, employer#153] ! +- Relation [id#151,name#152,employer#153] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.ColumnPruning === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) ! :- Project [id#116, name#117.first AS _extract_first#167, address#118, pets#119, friends#120, relatives#121, employer#122.company.name AS _extract_name#169, relations#123, p#124] :- Project [_extract_first#167, _extract_name#169] ! : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet : +- Project [name#117.first AS _extract_first#167, employer#122.company.name AS _extract_name#169] ! +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! +- Relation [id#151,name#152,employer#153] parquet +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] ! +- Relation [id#151,name#152,employer#153] parquet === Applying Rule org.apache.spark.sql.catalyst.optimizer.CollapseProject === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) ! :- Project [_extract_first#167, _extract_name#169] :- Project [name#117.first AS _extract_first#167, employer#122.company.name AS _extract_name#169] ! : +- Project [name#117.first AS _extract_first#167, employer#122.company.name AS _extract_name#169] : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] ! +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] +- Relation [id#151,name#152,employer#153] parquet ! +- Relation [id#151,name#152,employer#153] parquet === Result of Batch RewriteSubquery === Aggregate [count(1) AS count(1)#164L] Aggregate [count(1) AS count(1)#164L] +- Project +- Project ! +- Filter NOT exists#157 [name#117 && employer#122 && (name#152.first = name#117.first) && (employer#153.name = employer#122.company.name)] +- Join LeftAnti, ((_extract_first#166 = _extract_first#167) AND (_extract_name#168 = _extract_name#169)) ! : +- Project [null AS NULL#163, name#152, employer#153] :- Project [name#117.first AS _extract_first#167, employer#122.company.name AS _extract_name#169] ! : +- Relation [id#151,name#152,employer#153] parquet : +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet ! +- Relation [id#116,name#117,address#118,pets#119,friends#120,relatives#121,employer#122,relations#123,p#124] parquet +- Project [name#152.first AS _extract_first#166, employer#153.name AS _extract_name#168] !

We don't find that before because we don't have the test coverage?

ping @allisonwang-db

viirya · 2022-04-25T23:22:39Z

Hmm, there seems related test failure:

[info] - case insensitivity with scala reflection *** FAILED *** (89 milliseconds)
[info]   java.lang.IllegalArgumentException: b does not exist. Available: a, B, n
[info]   at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:282)
[info]   at scala.collection.immutable.Map$Map3.getOrElse(Map.scala:336)
[info]   at org.apache.spark.sql.types.StructType.apply(StructType.scala:281)
[info]   at org.apache.spark.sql.catalyst.expressions.ProjectionOverSchema.getProjection(ProjectionOverSchema.scala:39)

there are test failures

viirya

Looks good. Just want to make sure the excludedOnceBatches change is not caused by this.

allisonwang-db · 2022-04-27T05:18:04Z

Just want to make sure the excludedOnceBatches change is not caused by this.

@viirya That's correct. It is not caused by this PR. The new test case happens to expose the idempotency issue that was not discovered before.

viirya · 2022-04-27T05:26:39Z

@allisonwang-db I saw you only list 3.3.0 in "Affects Versions". But I guess this should also be an issue in 3.2?

allisonwang-db · 2022-04-27T05:33:31Z

@viirya Yes! This fix also needs to be in 3.0/3.1/3.2.

viirya · 2022-04-27T05:39:13Z

Thanks. Merging to master/3.3.

… that do not belong to the current relation ### What changes were proposed in this pull request? This PR updates `ProjectionOverSchema` to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions. ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36216 from allisonwang-db/spark-38918-nested-column-pruning. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 150434b) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2022-04-27T05:41:11Z

@allisonwang-db There are conflicts in 3.2/3.1/3.0. Can you create separate PR(s) for them?

… that do not belong to the current relation This PR updates `ProjectionOverSchema` to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions. To fix a bug in `SchemaPruning`. No Unit test Closes apache#36216 from allisonwang-db/spark-38918-nested-column-pruning. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 150434b) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 793ba60) Signed-off-by: allisonwang-db <allison.wang@databricks.com>

…butes that do not belong to the current relation ### What changes were proposed in this pull request? Backport #36216 to branch-3.1. ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36387 from allisonwang-db/spark-38918-branch-3.1. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

…butes that do not belong to the current relation ### What changes were proposed in this pull request? Backport #36216 to branch-3.0. ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36388 from allisonwang-db/spark-38918-branch-3.0. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

… that do not belong to the current relation This PR updates `ProjectionOverSchema` to use the outputs of the data source relation to filter the attributes in the nested schema pruning. This is needed because the attributes in the schema do not necessarily belong to the current data source relation. For example, if a filter contains a correlated subquery, then the subquery's children can contain attributes from both the inner query and the outer query. Since the `RewriteSubquery` batch happens after early scan pushdown rules, nested schema pruning can wrongly use the inner query's attributes to prune the outer query data schema, thus causing wrong results and unexpected exceptions. To fix a bug in `SchemaPruning`. No Unit test Closes apache#36216 from allisonwang-db/spark-38918-nested-column-pruning. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 150434b) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 793ba60) Signed-off-by: allisonwang-db <allison.wang@databricks.com>

…butes that do not belong to the current relation ### What changes were proposed in this pull request? Backport #36216 to branch-3.2 ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #36386 from allisonwang-db/spark-38918-branch-3.2. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…butes that do not belong to the current relation ### What changes were proposed in this pull request? Backport apache#36216 to branch-3.2 ### Why are the changes needed? To fix a bug in `SchemaPruning`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes apache#36386 from allisonwang-db/spark-38918-branch-3.2. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

fix schema pruning

ce02f75

github-actions bot added the SQL label Apr 16, 2022

viirya reviewed Apr 18, 2022

View reviewed changes

viirya mentioned this pull request Apr 21, 2022

[SPARK-38977][SQL] Fix schema pruning with correlated subqueries #36303

Closed

address comments

4a251cb

viirya reviewed Apr 22, 2022

View reviewed changes

viirya previously approved these changes Apr 25, 2022

View reviewed changes

fix tests

0f7eaa6

viirya reviewed Apr 26, 2022

View reviewed changes

viirya approved these changes Apr 27, 2022

View reviewed changes

viirya closed this in 150434b Apr 27, 2022

allisonwang-db mentioned this pull request Apr 27, 2022

[SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation #36386

Closed

allisonwang-db mentioned this pull request Apr 27, 2022

[SPARK-38918][SQL][3.1] Nested column pruning should filter out attributes that do not belong to the current relation #36387

Closed

allisonwang-db mentioned this pull request Apr 27, 2022

[SPARK-38918][SQL][3.0] Nested column pruning should filter out attributes that do not belong to the current relation #36388

Closed

Conversation

allisonwang-db commented Apr 16, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 25, 2022

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Apr 27, 2022

Uh oh!

viirya commented Apr 27, 2022

Uh oh!

allisonwang-db commented Apr 27, 2022

Uh oh!

viirya commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viirya commented Apr 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya Apr 25, 2022 •

edited

Loading

viirya Apr 26, 2022 •

edited

Loading

viirya commented Apr 27, 2022 •

edited

Loading

viirya commented Apr 27, 2022 •

edited

Loading